Combining Data in pandas With merge(), .join(), and concat() – Real Python

2024-07-17 14:20| 来源: 网络整理| 查看: 265

Examples

Many pandas tutorials provide very simple DataFrames to illustrate the concepts that they are trying to explain. This approach can be confusing since you can’t relate the data to anything concrete. So, for this tutorial, you’ll use two real-world datasets as the DataFrames to be merged:

Climate normals for California (temperatures) Climate normals for California (precipitation)

You can explore these datasets and follow along with the examples below using the interactive Jupyter Notebook and climate data CSVs:

Download the notebook and data set: Click here to get the Jupyter Notebook and CSV data set you’ll use to learn about Pandas merge(), .join(), and concat() in this tutorial.

If you’d like to learn how to use Jupyter Notebooks, then check out Jupyter Notebook: An Introduction.

These two datasets are from the National Oceanic and Atmospheric Administration (NOAA) and were derived from the NOAA public data repository. First, load the datasets into separate DataFrames:

Python >>> import pandas as pd >>> climate_temp = pd.read_csv("climate_temp.csv") >>> climate_precip = pd.read_csv("climate_precip.csv") Copied!

In the code above, you used pandas’ read_csv() to conveniently load your source CSV files into DataFrame objects. You can then look at the headers and first few rows of the loaded DataFrames with .head():

Python >>> climate_temp.head() STATION STATION_NAME ... DLY-HTDD-BASE60 DLY-HTDD-NORMAL 0 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15 1 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15 2 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15 3 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15 4 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15 >>> climate_precip.head() STATION ... DLY-SNOW-PCTALL-GE050TI 0 GHCND:USC00049099 ... -9999 1 GHCND:USC00049099 ... -9999 2 GHCND:USC00049099 ... -9999 3 GHCND:USC00049099 ... 0 4 GHCND:USC00049099 ... 0 Copied!

Here, you used .head() to get the first five rows of each DataFrame. Make sure to try this on your own, either with the interactive Jupyter Notebook or in your console, so that you can explore the data in greater depth.

Next, take a quick look at the dimensions of the two DataFrames:

Python >>> climate_temp.shape (127020, 21) >>> climate_precip.shape (151110, 29) Copied!

Note that .shape is a property of DataFrame objects that tells you the dimensions of the DataFrame. For climate_temp, the output of .shape says that the DataFrame has 127,020 rows and 21 columns.

Inner Join

In this example, you’ll use merge() with its default arguments, which will result in an inner join. Remember that in an inner join, you’ll lose rows that don’t have a match in the other DataFrame’s key column.

With the two datasets loaded into DataFrame objects, you’ll select a small slice of the precipitation dataset and then use a plain merge() call to do an inner join. This will result in a smaller, more focused dataset:

Python >>> precip_one_station = climate_precip.query("STATION == 'GHCND:USC00045721'") >>> precip_one_station.head() STATION ... DLY-SNOW-PCTALL-GE050TI 1460 GHCND:USC00045721 ... -9999 1461 GHCND:USC00045721 ... -9999 1462 GHCND:USC00045721 ... -9999 1463 GHCND:USC00045721 ... -9999 1464 GHCND:USC00045721 ... -9999 Copied!

Here you’ve created a new DataFrame called precip_one_station from the climate_precip DataFrame, selecting only rows in which the STATION field is "GHCND:USC00045721".

If you check the shape attribute, then you’ll see that it has 365 rows. When you do the merge, how many rows do you think you’ll get in the merged DataFrame? Remember that you’ll be doing an inner join:

Python >>> inner_merged = pd.merge(precip_one_station, climate_temp) >>> inner_merged.head() STATION STATION_NAME ... DLY-HTDD-BASE60 DLY-HTDD-NORMAL 0 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19 1 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19 2 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19 3 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19 4 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19 >>> inner_merged.shape (365, 47) Copied!

If you guessed 365 rows, then you were correct! This is because merge() defaults to an inner join, and an inner join will discard only those rows that don’t match. Because all of your rows had a match, none were lost. You should also notice that there are many more columns now: 47 to be exact.

With merge(), you also have control over which column(s) to join on. Let’s say that you want to merge both entire datasets, but only on Station and Date since the combination of the two will yield a unique value for each row. To do so, you can use the on parameter:

Python >>> inner_merged_total = pd.merge( ... climate_temp, climate_precip, on=["STATION", "DATE"] ... ) >>> inner_merged_total.shape (123005, 48) Copied!

You can specify a single key column with a string or multiple key columns with a list. This results in a DataFrame with 123,005 rows and 48 columns.

Why 48 columns instead of 47? Because you specified the key columns to join on, pandas doesn’t try to merge all mergeable columns. This can result in “duplicate” column names, which may or may not have different values.

“Duplicate” is in quotation marks because the column names will not be an exact match. By default, they are appended with _x and _y. You can also use the suffixes parameter to control what’s appended to the column names.

To prevent surprises, all the following examples will use the on parameter to specify the column or columns on which to join.

Outer Join

Here, you’ll specify an outer join with the how parameter. Remember from the diagrams above that in an outer join—also known as a full outer join—all rows from both DataFrames will be present in the new DataFrame.

If a row doesn’t have a match in the other DataFrame based on the key column(s), then you won’t lose the row like you would with an inner join. Instead, the row will be in the merged DataFrame, with NaN values filled in where appropriate.

This is best illustrated in an example:

Python >>> outer_merged = pd.merge( ... precip_one_station, climate_temp, how="outer", on=["STATION", "DATE"] ... ) >>> outer_merged.shape (127020, 48) Copied!

If you remember from when you checked the .shape attribute of climate_temp, then you’ll see that the number of rows in outer_merged is the same. With an outer join, you can expect to have the same number of rows as the larger DataFrame. That’s because no rows are lost in an outer join, even when they don’t have a match in the other DataFrame.

Left Join

In this example, you’ll specify a left join—also known as a left outer join—with the how parameter. Using a left outer join will leave your new merged DataFrame with all rows from the left DataFrame, while discarding rows from the right DataFrame that don’t have a match in the key column of the left DataFrame.

You can think of this as a half-outer, half-inner merge. The example below shows you this in action:

Python >>> left_merged = pd.merge( ... climate_temp, precip_one_station, how="left", on=["STATION", "DATE"] ... ) >>> left_merged.shape (127020, 48) Copied!

left_merged has 127,020 rows, matching the number of rows in the left DataFrame, climate_temp. To prove that this only holds for the left DataFrame, run the same code, but change the position of precip_one_station and climate_temp:

Python >>> left_merged_reversed = pd.merge( ... precip_one_station, climate_temp, how="left", on=["STATION", "DATE"] ... ) >>> left_merged_reversed.shape (365, 48) Copied!

This results in a DataFrame with 365 rows, matching the number of rows in precip_one_station.

Right Join

The right join, or right outer join, is the mirror-image version of the left join. With this join, all rows from the right DataFrame will be retained, while rows in the left DataFrame without a match in the key column of the right DataFrame will be discarded.

To demonstrate how right and left joins are mirror images of each other, in the example below you’ll recreate the left_merged DataFrame from above, only this time using a right join:

Python >>> right_merged = pd.merge( ... precip_one_station, climate_temp, how="right", on=["STATION", "DATE"] ... ) >>> right_merged.shape (127020, 48) Copied!

Here, you simply flipped the positions of the input DataFrames and specified a right join. When you inspect right_merged, you might notice that it’s not exactly the same as left_merged. The only difference between the two is the order of the columns: the first input’s columns will always be the first in the newly formed DataFrame.

merge() is the most complex of the pandas data combination tools. It’s also the foundation on which the other tools are built. Its complexity is its greatest strength, allowing you to combine datasets in every which way and to generate new insights into your data.

On the other hand, this complexity makes merge() difficult to use without an intuitive grasp of set theory and database operations. In this section, you’ve learned about the various data merging techniques, as well as many-to-one and many-to-many merges, which ultimately come from set theory. For more information on set theory, check out Sets in Python.

Now, you’ll look at .join(), a simplified version of merge().

Remove ads

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章